{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Similarity search with ApproxRetrievalStrategy and NLP model\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/integrations/langchain/llangchain-vector-store-approx-search.ipynb)\n",
    "\n",
    "This workbook demonstrates how to perform a similarity search using [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) and [ApproxRetrievalStrategy](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.ApproxRetrievalStrategy). First we will download sample dataset and  split documents into chunks using langchain and then index into elasticsearch through [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents).\n",
    "\n",
    "The [ApproxRetrievalStrategy](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.ApproxRetrievalStrategy) uses HNSW algorithm to find [nearest neighbor](https://en.wikipedia.org/wiki/Nearest_neighbor_search), which is the fastest and memory efficient algorithm. During the indexing, dense vector fields are generated and is store within the index. We can either provide an embedding function or provide a `query_model_id` to embed the query. In this notebook we will provide a `query_model_id` - during the query time. \n",
    "\n",
    "For our model, We will use [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to transform the search text into the dense vector.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install packages and import modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python3 -m pip install -qU langchain langchain-elasticsearch tiktoken sentence-transformers==2.7.0 \"eland<9\" transformers\n",
    "\n",
    "from getpass import getpass\n",
    "from langchain_elasticsearch import ElasticsearchStore\n",
    "from urllib.request import urlopen\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "import json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connect to Elasticsearch\n",
    "\n",
    "ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.\n",
    "\n",
    "We'll use the Cloud ID to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.\n",
    "\n",
    "We will use ElasticsearchStore to connect to our elastic cloud deployment, This would help create and index data easily. In the constructor, we will explicity mention `query_field` and `vector_query_field`, to map our inference pipeline to source and target fields.\n",
    "\n",
    "Note: For demonstration we will explicity set strategy to `ElasticsearchStore.ApproxRetrievalStrategy` although ElasticsearchStore uses `ApproxRetrievalStrategy` strategy by default.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
    "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
    "\n",
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
    "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
    "\n",
    "vector_store = ElasticsearchStore(\n",
    "    es_cloud_id=ELASTIC_CLOUD_ID,\n",
    "    es_api_key=ELASTIC_API_KEY,\n",
    "    query_field=\"text_field\",\n",
    "    vector_query_field=\"vector_query_field.predicted_value\",\n",
    "    index_name=\"approx-search-demo\",\n",
    "    strategy=ElasticsearchStore.ApproxRetrievalStrategy(\n",
    "        query_model_id=\"sentence-transformers__all-minilm-l6-v2\"\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy model using Eland\n",
    "\n",
    "ℹ️ Once you have created elastic cloud deployment, [autoscale](https://www.elastic.co/guide/en/cloud/current/ec-autoscaling.html) to have least one machine learning (ML) node with enough (4GB) memory. Also ensure that the Elasticsearch cluster is running. \n",
    "\n",
    "\n",
    "We are using the [`eland`](https://www.elastic.co/guide/en/elasticsearch/client/eland/current/overview.html) tool to install a `text_embedding` model - [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).  The model will transfer your search query into vector which will be used for the search over the set of documents stored in Elasticsearch. \n",
    "Using the [`eland_import_hub_model`](https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html#ml-nlp-pytorch) script, we can download and install `all-MiniLM-L6-v2` transformer model. Setting the NLP `--task-type` as `text_embedding`. \n",
    "\n",
    "Authenticate your request to cloud deployment by provided cloud id, cloud username and password. Alternatively, You could also use [API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html#create-api-key) in place of username and password.  \n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!eland_import_hub_model --cloud-id $ELASTIC_CLOUD_ID --es-api-key $ELASTIC_API_KEY --hub-model-id sentence-transformers/all-MiniLM-L6-v2 --task-type text_embedding --start --clear-previous"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the sample dataset\n",
    "\n",
    "Let's download the sample dataset and deserialize the document to make document chunking easier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/chatbot-rag-app/data/data.json\"\n",
    "response = urlopen(url)\n",
    "\n",
    "workplace_docs = json.loads(response.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create ingestion pipeline\n",
    "\n",
    "We need to create a text embedding ingest pipeline to generate vector (text) embeddings for `text_field`.\n",
    "\n",
    "The pipeline below is defining a processor for the [inference](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-processor.html) to the NLP model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PIPELINE_ID = \"vectorize_workplace\"\n",
    "\n",
    "vector_store.client.ingest.put_pipeline(\n",
    "    id=PIPELINE_ID,\n",
    "    processors=[\n",
    "        {\n",
    "            \"inference\": {\n",
    "                \"model_id\": \"sentence-transformers__all-minilm-l6-v2\",\n",
    "                \"field_map\": {\"query_field\": \"text_field\"},\n",
    "                \"target_field\": \"vector_query_field\",\n",
    "            }\n",
    "        }\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Index with mappings\n",
    "\n",
    "We will now create an elasticsearch index with correct mapping before we index documents. \n",
    "We are adding `predicted_value` to `vector_query_field` field to store the vector embeddings. We will search over the content and would be mapped to the `text_field`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "xAkc1OVcOxy3",
    "outputId": "b2453634-89b8-48bc-ac65-a6a1c3b8170f"
   },
   "outputs": [],
   "source": [
    "# define index name\n",
    "INDEX_NAME = \"approx-search-demo\"\n",
    "\n",
    "# flag to check if index has to be deleted before creating\n",
    "SHOULD_DELETE_INDEX = True\n",
    "\n",
    "# define index mapping\n",
    "INDEX_MAPPING = {\n",
    "    \"properties\": {\n",
    "        \"text_field\": {\"type\": \"text\"},\n",
    "        \"vector_query_field\": {\n",
    "            \"properties\": {\n",
    "                \"is_truncated\": {\"type\": \"boolean\"},\n",
    "                \"predicted_value\": {\n",
    "                    \"type\": \"dense_vector\",\n",
    "                    \"dims\": 384,\n",
    "                    \"index\": True,\n",
    "                    \"similarity\": \"l2_norm\",\n",
    "                },\n",
    "            }\n",
    "        },\n",
    "    }\n",
    "}\n",
    "\n",
    "\n",
    "INDEX_SETTINGS = {\"index\": {\"default_pipeline\": PIPELINE_ID}}\n",
    "\n",
    "# check if we want to delete index before creating the index\n",
    "if SHOULD_DELETE_INDEX:\n",
    "    if vector_store.client.indices.exists(index=INDEX_NAME):\n",
    "        vector_store.client.indices.delete(index=INDEX_NAME, ignore=[400, 404])\n",
    "\n",
    "vector_store.client.indices.create(\n",
    "    index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS, ignore=[400, 404]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Split Documents into Passages\n",
    "\n",
    "Next, We will chunk these documents into 800 token passages with an overlap of 0 tokens using a text splitter, [RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html) and then create documents from these texts. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "metadata = []\n",
    "content = []\n",
    "\n",
    "# data.json\n",
    "for doc in workplace_docs:\n",
    "    content.append(doc[\"content\"])\n",
    "    metadata.append({\"name\": doc[\"name\"], \"summary\": doc[\"summary\"]})\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
    "    chunk_size=512, chunk_overlap=256\n",
    ")\n",
    "docs = text_splitter.create_documents(content, metadatas=metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Index data into elasticsearch\n",
    "\n",
    "Now that we have our document ready, next we will index data to elasticsearch using [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents). We will use Cloud ID and Passwords values set in the Create cloud deployment step. \n",
    "\n",
    "We will  `query_model_id` to  `\"sentence-transformers__all-minilm-l6-v2\"` , to embed dense vectors.\n",
    "\n",
    "ℹ️ Note: Before you begin indexing, ensure you have [started your trained model deployment](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-deploy-model.html) in your index.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = ElasticsearchStore.from_documents(\n",
    "    docs,\n",
    "    es_cloud_id=ELASTIC_CLOUD_ID,\n",
    "    es_api_key=ELASTIC_API_KEY,\n",
    "    index_name=\"approx-search-demo\",\n",
    "    query_field=\"text_field\",\n",
    "    vector_query_field=\"vector_query_field.predicted_value\",\n",
    "    strategy=ElasticsearchStore.ApproxRetrievalStrategy(\n",
    "        query_model_id=\"sentence-transformers__all-minilm-l6-v2\"\n",
    "    ),\n",
    "    bulk_kwargs={\n",
    "        \"request_timeout\": 60,\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Querying the dataset with similarity_search\n",
    "\n",
    "Now that we have indexed our sample data to elasticsearch, we will perform a similarity search on query - `How does the compensation work?`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Document(metadata={'summary': 'This document outlines a compensation framework for IT teams. It includes job levels, compensation bands, and performance-based incentives to ensure fair and competitive wages. Regular market benchmarking will be conducted to adjust the bands according to industry trends.', 'name': 'Compensation Framework For It Teams'}, page_content=\"Introduction:\\nThis document outlines the compensation bands strategy for the various teams within our IT company. The goal is to establish a fair and competitive compensation structure that aligns with industry standards, rewards performance, and attracts top talent. By implementing this strategy, we aim to foster employee satisfaction and retention while ensuring the company's overall success.\\n\\nPurpose:\\nThe purpose of this compensation bands strategy is to:\\na. Define clear guidelines for salary ranges based on job levels and market benchmarks.\\nb. Support equitable compensation practices across different teams.\\nc. Encourage employee growth and performance.\\nd. Enable effective budgeting and resource allocation.\\n\\nJob Levels:\\nTo establish a comprehensive compensation structure, we have defined distinct job levels within each team. These levels reflect varying degrees of skills, experience, and responsibilities. The levels include:\\na. Entry-Level: Employees with limited experience or early career professionals.\\nb. Intermediate-Level: Employees with moderate experience and demonstrated competence.\\nc. Senior-Level: Experienced employees with advanced skills and leadership capabilities.\\nd. Leadership-Level: Managers and team leaders responsible for strategic decision-making.\\n\\nCompensation Bands:\\nBased on the job levels, the following compensation bands have been established:\\na. Entry-Level Band: This band encompasses salary ranges for employees in entry-level positions. It aims to provide competitive compensation for individuals starting their careers within the company.\\n\\nb. Intermediate-Level Band: This band covers salary ranges for employees who have gained moderate experience and expertise in their respective roles. It rewards employees for their growing skill set and contributions.\\n\\nc. Senior-Level Band: The senior-level band includes salary ranges for experienced employees who have attained advanced skills and have a proven track record of delivering results. It reflects the increased responsibilities and expectations placed upon these individuals.\\n\\nd. Leadership-Level Band: This band comprises salary ranges for managers and team leaders responsible for guiding and overseeing their respective teams. It considers their leadership abilities, strategic thinking, and the impact they have on the company's success.\\n\\nMarket Benchmarking:\\nTo ensure our compensation remains competitive, regular market benchmarking will be conducted. This involves analyzing industry salary trends, regional compensation data, and market demand for specific roles. The findings will inform periodic adjustments to our compensation bands to maintain alignment with the market.\"), Document(metadata={'summary': 'This document outlines a compensation framework for IT teams. It includes job levels, compensation bands, and performance-based incentives to ensure fair and competitive wages. Regular market benchmarking will be conducted to adjust the bands according to industry trends.', 'name': 'Compensation Framework For It Teams'}, page_content=\"Compensation Bands:\\nBased on the job levels, the following compensation bands have been established:\\na. Entry-Level Band: This band encompasses salary ranges for employees in entry-level positions. It aims to provide competitive compensation for individuals starting their careers within the company.\\n\\nb. Intermediate-Level Band: This band covers salary ranges for employees who have gained moderate experience and expertise in their respective roles. It rewards employees for their growing skill set and contributions.\\n\\nc. Senior-Level Band: The senior-level band includes salary ranges for experienced employees who have attained advanced skills and have a proven track record of delivering results. It reflects the increased responsibilities and expectations placed upon these individuals.\\n\\nd. Leadership-Level Band: This band comprises salary ranges for managers and team leaders responsible for guiding and overseeing their respective teams. It considers their leadership abilities, strategic thinking, and the impact they have on the company's success.\\n\\nMarket Benchmarking:\\nTo ensure our compensation remains competitive, regular market benchmarking will be conducted. This involves analyzing industry salary trends, regional compensation data, and market demand for specific roles. The findings will inform periodic adjustments to our compensation bands to maintain alignment with the market.\\n\\nPerformance-Based Compensation:\\nIn addition to the defined compensation bands, we emphasize a performance-based compensation model. Performance evaluations will be conducted regularly, and employees exceeding performance expectations will be eligible for bonuses, incentives, and salary increases. This approach rewards high achievers and motivates employees to excel in their roles.\\n\\nConclusion:\\nBy implementing this compensation bands strategy, our IT company aims to establish fair and competitive compensation practices that align with market standards and foster employee satisfaction. Regular evaluations and market benchmarking will enable us to adapt and refine the strategy to meet the evolving needs of our organization.\"), Document(metadata={'summary': 'This Performance Management Policy outlines a consistent and transparent process for evaluating, recognizing, and rewarding employees. It includes goal setting, ongoing feedback, performance evaluations, ratings, promotions, and rewards. The policy applies to all employees and encourages open communication and professional growth.', 'name': 'Performance Management Policy'}, page_content='Performance Ratings\\nBased on the performance evaluation, employees will receive a performance rating that reflects their overall performance during the cycle. The rating system should be clearly defined and consistently applied across the organization. Performance ratings will be used to inform decisions regarding promotions, salary increases, and other rewards or recognition.\\nPromotions and Advancements\\nHigh-performing employees who consistently demonstrate strong performance, leadership, and a commitment to the company’s values may be considered for promotions or other advancement opportunities. Promotions will be based on factors such as performance ratings, skills, experience, and the needs of the organization. Employees interested in pursuing a promotion should discuss their career goals and development plans with their supervisor.\\nPerformance Improvement Plans\\nEmployees who receive a low performance rating or are struggling to meet their performance goals may be placed on a Performance Improvement Plan (PIP). A PIP is a structured plan designed to help the employee address specific areas of concern, set achievable improvement goals, and receive additional support or resources as needed. Employees on a PIP will be closely monitored and re-evaluated at the end of the improvement period to determine if satisfactory progress has been made.\\nRecognition and Rewards\\nOur company believes in recognizing and rewarding employees for their hard work and dedication. In addition to promotions and salary increases, employees may be eligible for other forms of recognition or rewards based on their performance. This may include bonuses, awards, or other incentives designed to motivate and celebrate employee achievements. The specific criteria and eligibility for these rewards will be communicated by the HR department or management.'), Document(metadata={'summary': ': This policy outlines the guidelines and procedures for requesting and taking time off from work for personal and leisure purposes. Full-time employees accrue vacation time at a rate of [X hours] per month, equivalent to [Y days] per year. Vacation requests must be submitted to supervisors at least', 'name': 'Company Vacation Policy'}, page_content=\"Purpose\\n\\nThe purpose of this vacation policy is to outline the guidelines and procedures for requesting and taking time off from work for personal and leisure purposes. This policy aims to promote a healthy work-life balance and encourage employees to take time to rest and recharge.\\nScope\\n\\nThis policy applies to all full-time and part-time employees who have completed their probationary period.\\nVacation Accrual\\n\\nFull-time employees accrue vacation time at a rate of [X hours] per month, equivalent to [Y days] per year. Part-time employees accrue vacation time on a pro-rata basis, calculated according to their scheduled work hours.\\n\\nVacation time will begin to accrue from the first day of employment, but employees are eligible to take vacation time only after completing their probationary period. Unused vacation time will be carried over to the next year, up to a maximum of [Z days]. Any additional unused vacation time will be forfeited.\\nVacation Scheduling\\n\\nEmployees are required to submit vacation requests to their supervisor at least [A weeks] in advance, specifying the start and end dates of their vacation. Supervisors will review and approve vacation requests based on business needs, ensuring adequate coverage during the employee's absence.\\n\\nEmployees are encouraged to plan their vacations around the company's peak and non-peak periods to minimize disruptions. Vacation requests during peak periods may be subject to limitations and require additional advance notice.\\nVacation Pay\\n\\nEmployees will receive their regular pay during their approved vacation time. Vacation pay will be calculated based on the employee's average earnings over the [B weeks] preceding their vacation.\\nUnplanned Absences and Vacation Time\\n\\nIn the event of an unplanned absence due to illness or personal emergencies, employees may use their accrued vacation time, subject to supervisor approval. Employees must inform their supervisor as soon as possible and provide any required documentation upon their return to work.\\nVacation Time and Termination of Employment\\n\\nIf an employee's employment is terminated, they will be paid out for any unused vacation time, calculated based on their current rate of pay.\\nPolicy Review and Updates\\n\\nThis vacation policy will be reviewed periodically and updated as necessary, taking into account changes in labor laws, business needs, and employee feedback.\\nQuestions and Concerns\\n\\nEmployees are encouraged to direct any questions or concerns about this policy to their supervisor or the HR department.\")]\n"
     ]
    }
   ],
   "source": [
    "results = vector_store.similarity_search(\"How does compensation work\")\n",
    "\n",
    "print(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "Now you know how to integrate LangChain with Elasticsearch vector store, using your choice of NLP model and run similarity search on the indexed dataset to get the top 4 similar content. \n",
    "\n",
    "Next, checkout our [langchain-vector-store](https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/langchain/langchain-vector-store.ipynb) notebook to see how to query and filter on the metadata."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.3"
  },
  "vscode": {
   "interpreter": {
    "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
